1. Business case

The Covid event created new incentives for city dwellers to move from condos to detached either in or outside Toronto.

In this contest we propose a simple tool for getting informed on the transactions that have happened in Toronto with the purpose of both:

  1. estimating of the market value of a property in Toronto given location and other userr defined features

  2. and, given budget and other constraints,visualizing the properties sold.

At the moment, given the project time budget constraint, even though we have been able to web scrape the latest available listings on zoocasa but did not have sufficient time to clean and process the data. Instead, toillustrate the business case solution, we used a pre-existing dataset available as at H2 2019 which was pre-cleaned and processed.

2. Project data

2.1 Data sources

The current project builds upon a pre-existing project on Toronto housing price prediction available at https://github.com/slavaspirin/Toronto-housing-price-prediction

The dataset used from the pre-existing project is available at https://github.com/slavaspirin/Toronto-housing-price-prediction/raw/master/houses_edited.csv

The data from the pre-existing project is based on web scarping of Zoocasa listings of previously sold properties. Unfortunately the data does not have a time stamp. We understand that the primary listing data was scraped from https://www.zoocasa.com and contains a list of sold properties available as sometimes in H2 2019.

We were able to scrap a scoring of Toronto neighborhoods from https://torontolife.com/neighbourhood-rankings/ to complemet the average 2016 personal income data of each district that was already available in the pre-existing project.

Toronto neighborhoods data for geographical mapping was available from https://open.toronto.ca/dataset/neighbourhoods/.

We were also able to obtain location data for the Toronto subway stations from https://scruss.com/blog/2005/12/14/toronto-subway-station-gps-locations/#comments. For the line currently in construction, Line 5 Eglington, the subway stations latitude and longitude were obtained from https://en.wikipedia.org/wiki/Line_5_Eglinton.

2.2 Data description

Zoocasa listings in the pre-existing project dataset

The data available includes 15234 listings of Toronto properties with the following available features.

Variable Name Description
title text, Zoocasa short description of the listing
final price numeric,sale price
listed price numeric, listed price
bedrooms text ordinal, 0 beds, 0 + 1 beds, 1 beds … 9 + 5 beds
bedrooms>grade numeric, number of bedrooms above grade
bedrooms<grade numeric, number of bedrooms below grade
bathrooms mumeric, 1 to 11
sqft Missing or numeric between 259 to 4374
description text, Zoocasa long description of the listed property
mls text, zoocasa identifier
type text categorical, Att/Row/Twnhouse, Comm Element Condo, Condo Apt, Condo Townhouse, Co-Op Apt, Co-Ownership Apt, Detached, Link, Plex, Semi-Detached, Store W/Apt/Offc
full link text,Zoocasa web link
lat numeric, property location latitude
long numeric, property location longitude
city district text, Toronto city district
district code numeric, Toronto city district identifier code
mean district income numeric, Toronto city district average household income based on 2016 statics

Based on the “bedrooms>grade” and “bedrooms<grade” we created an aggregated bedrooms feature calculated as “bedrooms>grade”+bedrooms<grade/2 to account for the smaller size of the below grade bedrooms.

Based on the “listed price” and “final price” we created an “price differential” feature calculated as “final price”/“listed price” - 1. Even though such a feature is not necessarily useful for predicting the sale price, it is informative with respect to the pricing error that property sellers have encountered and may be informed upon with when listing a property.

Toronto subway - walking distance to the closest subway station

Based on location data for the Toronto subway stations (including Line 5 Eglinton) we were able to estimate the closest subway station to a property and estimate, assuming and average walking speed of 5 km/h the walking distance to the closest subway station for each property.

Toronto neighbourhood rankings

Variable Name Description
district code numeric,Toronto city district identifier code
area name text,Toronto city district name
description text,description of the district
housing score numeric, score based on affordability (cost vs. income), appreciation (yoy change) and rate of home ownership
safety score numeric, score based on number of crimes
transit score numeric, score based on number of TTC stops, walk and transit scores, commuting times, numbers of commuters who walk, cycle or take TTC
shopping score numeric, score based on number of groceries, markets and pharmacies per km2
health score numeric, score based on number of medical and mental health services per capita, number of senior care services per senior, number of people with family doctors and physical activity levels among residents
entertainment score score based on numeric,number of gyms, sport facilities, bars and restaurants per km2
community score numeric, score based on voter turnout, community space use per capita, how many people report a sense of community belonging
education score numeric, score based on number of schools per child, number of daycares per child, share of residents with post-secondary education
diversity score numeric, score based on % of visible minorities , people whose mother tongues are not French or English, and first- and second generation immigrants
employment score numeric, score based on employment and unemployment rates, the share of residents below the poverty line, the share of high income residents and the share of self employed residents

Distance to closest subway

2.3 Data preparation

The data pertaining to housing listings was cleaned,aggregated and readily available on https://github.com/slavaspirin/Toronto-housing-price-prediction/raw/master/houses_edited.csv

2.4 Descriptive statistics

2.4.1. Property sale price distribution

The histogram of sale prices and log sale prices for the most condos, townhouses, detached and semi-detached(the inclusion criteria captures 94% of our listings) indicate that the log transforms shift the distribution closer to a Gaussian one. This will allow us to better model the listings with prices around the mode of the distribution and below.

One can also observe that the sale relative to listed price has a positively skewed distribution when above zero, meaning that sellers were getting more than listed. Nevertheless, the very high values observed (50% mark up) make us less inclined to use the listed price in our analysis. It is possible there may be a bias related to increasing the marketability of the property and gathering more offers during the listing period.

2.4.2. Property features characteristics

An interesting property of condos is related to the subway walking distance feature. Condos are concentrated within less than 39 minutes to the closest subway station as the corresponding histogram abruptly drops around the 39 minutes threshold.

2.4.3. Toronto districts descriptive statistics

The Toronto district scores seem to be designed to be uniformly distributed, in contrast with the district average income which seems to be non-uniform.

One can observe the following strong correlations:

Given the nature of the district average income having a different distribution that teh districts scores we computes the Spearman (rank) correlation.

  1. the diversity score is strongly negatively correlated with the employment score and the average district income,
  2. the employment score is strongly positively correlated with the average district income and
  3. transit, shopping end entertainment scores are highly correlated.

We have also computed the Pearson correlations and observed that the most extreme values was 0.83 (shopping vs. entertainment scores). As such, these correlations would not pose problems in a linear regression (i.e. the X’X matrix is invertible).

2.4.4. Correlation - sale price and properties features vs. with Toronto’s districts scores and average income

We have computed the Pearson correlations to observe how all property and location features are interconnected. As expected the size of the apartment influences the sale price. In terms of district location, district average income and teh sores for safety, diversity and employment are also significantly correlated (1% p-value test) with the sale price.

Unexpectedly, the “subway walking distance” to has a negative, small correlation with the final sale price. One of the reasons could be related to difference in this relationship across types of properties. For condos, subway closeness is much more important than for detached,. For detached houses, the size of the property, and implicitly the price, is higher, the further away one gets from the subway lines.

We note that the most extreme correlation values are below 0.85 (excluding the “bedroom>grade” vs “Bedrooms Agg” correlation). As such, these correlations would not pose problems in a linear regression (i.e. the X’X matrix is invertible).

2.4.5. Sale price distribution in relation with various features

## [1] "As expected, there is a clear dependency between the type of house and the sale price. Condos have a price distribution centered at a lowar level than detached, Plex and Semi-Detached."

## [1] "As expected, there is a clear increasing relationship between the number of bedrooms and the sale price"

## [1] "As expected, there is a clear increasing relationship between the number of bathrooms and the sale price"

## [1] "There is a clear increasing relationship between the surface of property and the sale price. One can observe that a lot of properties sold have missing square footage data which will need to be imputed to fill in missing data."

## [1] "There is a not a clear relationship between the distance to the closest subway and the sale price. This might be due to an uneven mix of types of properties which depends on subway proximity. For condos, subway closeness is much more important than for detached. For detached houses, the size of the property, and implicitly the price, is higher, the further away one gets from the subway lines."

## [1] "There is a clear increasing relationship between the district wealth level and the sale price"

## [1] "Although there is no clear relationship between the neignourhood transit score and the average level of the sale price, the shape of the  sale price distribution varies visibly which might be related to different proparty types mixes, depending on the district and the transit network"

## [1] "Although there is no clear relationship between the district shopping score and the sale price"

## [1] "Although there is no clear relationship between the neignourhood healthcare score and the average level of the sale price, the shape of the sale price distribution varies visibly which might be related to different proparty types mixes, depending on the district and the healthcare network"

## [1] "There is a clear ... relationship between the district healthcare entertainment and the sale price"

## [1] "There is a no clear relationship between the district healthcare entertainment and the sale price"

## [1] "There is a clear decreasing relationship between the district community score and the sale price"

## [1] "There is a no clear relationship between the district education score and the sale price"

## [1] "There is a clear increasing relationship between the district employment score and the sale price"

2.4.6. Descriptive statistics by geographical district

The following mappings attempt to provide an overview of the listings available as wel as the relationship between districts and house prices.

One can observe that the listings not uniformly concentrated across Toronto. With respect to housing price levels, both the detached and condo market seem to be exhibiting a positive correlation with the district average income( Agincourt, the Beaches to Cliffcrest zone, Princess-Rosethorn), confirming the positive correlation between prices and district average income.

3. Modeling

The data available sets us in a context where we need to estimate the price of a property given a set of numerical, ordinal and categorical features that describe the property.

3.1. Imputation of missing values

Torstend and Robin, please input text here about how we did this

3.2. Linear regression

Torstend and Robin, please input text here to describe the model

Linear regression can accommodate both numerical and categorical explanatory variables (the categorical predictors are transformed into dummy variables).

To understand and asses the potential effect of the dimensionality reduction methods we employed three methods:

  1. a linear regression which includes all explanatory variables without any dimensionality reduction method

  2. a stepwise model for explanatory variable selection based on the Akaike Information Criterion (depends on the model log likelihood and penalizes the addition of parameters)

  3. a lasso-ridge penalty approach which, once our likelihood function, which penalizes large coefficients (their square value). As such, only high impact variables, that make it through the penalty, will be retained. To fine tune the weight the penalty function and ensure the result is robust to changes in the train data, we use 10 folds cross validation.

3.2.1. Linear regression - model estimation and dimesion reduction ????

Torsten and Robin, please input here the model final form and discussion of it’s features

3.2.2. Linear regression - model diagnostics

Torsten and Robin, please input here the model diagnostics results

3.3. Random Forest Model

Random Forest is an ensembling machine learning algorithm which consists in generating multiple decision trees based on random sampling of the data and the predictor variables and then combining their output. for every explanatory variable a classification is generated. Based on belonging to specific classes of a randomly selected set of variable the decision three is built. The user decides how many variables are used in constructing the decision tree. The construction of the classes for each variables accommodates both categorical and numerical variables (continuous or discontinuous) .

The random sampling serves to de-correlate the trees and subsequently reduce the Variance by averaging them and avoid overfitting. The user decides how many trees are used for model averaging. The random forest model needs to be tuned with respect to the number of decision trees and the number of variables randomly sampled at each stage.

In our project, we used the random forest model as a compariosn to the linear regression model with respect to the selection of explanatory variables , estimation errors and model Fit.

One can observe that the distance to subway, number of bathrooms and average district income occur across all random forest models.

For detached, district scores on community, education and trasit apper in addition to the above

For condos, district scores on shopping and health appear in addition to the above.

3.3.1. Random Forest - model estimation results

3.2.2. Random Forest - model diagnostics

4. Deployment

5. Future developments

5.1. Models

5.2. Shiny app